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Abstract 


The  architecture  of  a  recurrent  neural-network-based  content-addressable 
memory  is  detailed  along  with  a  companion  training  algorithm.  The  memory 
is  designed  to  store  vectors  composed  of  strings  of  the  integers  1  through  9. 
The  performance  characteristics  of  the  model — memory  capacity  and  basin 
size — are  presented. 
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1.  Introduction 


In  a  seminal  paper  [1],  Hopfield  introduced  an  associative  (content- 
addressable)  memory  based  on  a  recurrent  neural  network  architecture.  De¬ 
spite  the  dynamical  nature  of  a  recurrent  network,  convergence  to  stable 
memory  states  for  this  model  is  guaranteed.  Both  the  memory  capacity  of 
the  Hopfield  network  and  the  basins  of  attraction  of  its  memory  states  have 
been  examined  [2,3].  The  network  was  demonstrated  to  suffer  from  a  limited 
memory  capacity  and  very  irregular  basins.  Although  the  original  Hopfield 
model  is  a  binary  state  device,  a  model  with  a  multilevel  capability  was  later 
demonstrated  [4],  opening  the  possibility  of  many  other  architectural  con¬ 
structs.  This  report  details  a  more  complex  neural- network  architecture  and 
a  companion  training  algorithm  designed  to  convert  the  network  to  a  content- 
addressable  memory.  The  memory  is  designed  to  store  integer  vectors  and  to 
have  large,  error-free  basins  of  attraction. 
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2.  Neural-Network  Model 


The  architecture  of  the  neural  network  to  be  described  can  be  characterized 
as  locally  feedback  (recurrent)  and  globally  feedforward.  The  neural  network 
is  composed  of  M  stages,  each  containing  N  nodes.  Within  each  stage,  all 
nodes  are  fully  connected,  that  is,  each  node  communicates  information  to 
all  other  nodes.  A  node  can  exist  in  one  of  nine  discrete  states,  defined  by 
the  integers  1  through  9.  Feedback  is  incorporated  within  each  stage.  The 
original  state  of  each  node  n,  representing  the  input  data  set,  is  defined  as 
The  subsequent  states  of  each  node  are  determined  by 

v;‘„  =  [o.5  +  i  y;  ,  n=l,...Af,  k=l,...3,  (1) 

m=l 


where 

a  is  the  designation  of  the  input  vector  and  its  children; 

T  indicates  that  the  function  is  to  undergo  a  real  to  integer  transfor¬ 
mation  according  to  the  following  rule:  if  /<K;n</  +  0.999..., 
then  =  /,  where  I  is  an  integer; 

j  is  a  number,  1  through  9,  that  takes  the  value  of  thus,  the 

notation  Wj^n,m{ya,m)  indicates  that  the  value  of  Wj^n,m  (the  weight 
that  links  nodes  n  and  m)  is  a  function  of  the  state  of  node 
m  {j  =  14^“^)  (henceforth,  I  abbreviate  the  weight  term  as  simply 

Wj,n,n  =  0. 

The  upper  limit  on  k  was  chosen  because  it  was  found  that  at  A:  =  3, 
invariably  defines  a  stable  point  attractor.  This  is  due,  in  large  measure,  to 
the  design  of  the  training  algorithm. 

All  neural  network  stages  are  cascaded.  The  output  of  one  stage,  that  is,  the 
Vln  state,  is  the  input  to  the  next  stage. 
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3.  Training  Algorithm 


The  neural  network  is  intended  to  be  a  content-addressable  memory.  Stored 
vectors  are  to  be  point  attractors.  Associated  with  each  attractor  is  to  be  a 
regular,  error-free  basin  of  attraction.  By  regular  I  mean  that  within  some 
Euclidean  distance  of  each  point  attractor,  all  vectors  are  trapped  within  its 
basin. 

Training  proceeds  in  several  steps.  As  the  evolving  family  of  neural-network 
weights  approaches  its  final  state,  increasing  levels  of  sophistication  are  incor¬ 
porated  into  the  algorithm.  The  goal  is  to  determine  the  relationship  between 
node  count  and  memory  storage  capacity — the  scaling  law — and  the  charac¬ 
teristics  of  the  basins  of  attraction — primarily,  their  size.  The  neural  network 
is  trained  by  being  introduced,  one  at  a  time,  to  random  vectors  chosen  to 
cluster  near  the  stored  memories.  The  neural  network  weights  are  then  ad¬ 
justed  in  response  to  each  probing  vector.  This  procedure  continues  until  the 
network  is  fully  trained. 

To  the  extent  that  any  neural-network  training  algorithm  can  be  considered  a 
standard,  that  standard  is  backpropagation  [5].  In  backpropagation,  the  algo¬ 
rithm  defines  a  measure  of  error  between  the  actual  and  desired  output  of  the 
network  and  then  propagates  through  the  net  a  small  correction  to  this  error 
(via  gradient  descent).  The  recurrent  network  of  the  present  model  has  its 
state  updated  asynchronously.  Because  each  node  is  sequentially  processed, 
this  process  is  potentially  slow.  To  speed  it  up,  a  major  requirement  imposed 
on  the  training  algorithm  is  that  it  limits  the  number  of  cycles  through  the 
net  node  set  before  a  final,  stable  attractor  state  is  achieved.  We  can  accom¬ 
plish  this  (or,  at  least,  attempt  it)  by  forcing  training  to  proceed  not  in  small 
increments  as  with  backpropagation,  but  rather  in  large  increments  designed 
to  force  the  evolving  network  to  an  attractor  state  as  rapidly  as  possible. 
Unlike  backpropagation,  training  proceeds  in  a  forward  direction.  The  first 
stage  of  the  network  is  fully  trained,  and  then  the  second  stage  is  trained. 
This  procedure  continues  for  all  stages.  This  approach  makes  it  especially 
convenient  to  determine  the  effect  of  stage  count  on  network  performance. 

One  final  note:  because  the  activation  function  of  equation  (1)  is  dis¬ 
continuous,  a  training  algorithm  of  the  backpropagation  type — one  based 
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rigorously  on  gradient  descent — is  not  feasible.  The  approach  taken  must  be 
more  phenomenological. 


Step  One  Training 

Let  b  be  the  designation  of  one  of  the  vectors  to  be  stored  in  the  neural 
network.  Let  i  be  the  designation  of  a  randomly  chosen  vector.  Define  the 
following  term: 

-  v;„f,  (2) 

n=l 

where  is  the  square  of  the  distance  (Euclidean  metric)  from  memory 
vector  b  to  test  vector  i  and  is  the  basis  for  determining  the  size  for  the 
memory  basin  of  attraction.  The  convention  used  throughout  this  report  is 
to  designate  an  attractor  state  b  as  The  assumption  is  that,  when  the 
network  is  fully  trained,  To  simplify  all  further  discussions, 

I  consider  only  a  single  component  of  the  above  vectors,  to  be  designated  n. 
It  is  a  simple  matter  to  extend  all  equations  to  the  N  terms  of  the  vector. 

In  the  discussion  of  the  training  process,  several  terms  need  to  be  defined. 
First,  generate  the  error  value 

El.  =  E  -  VS„N.  (3) 

m=l 

This  is  the  difference  between  the  desired  vector  n  component  value  of  the 
point  attractor,  and  the  component  value  generated  by  the  randomly 
chosen  vector  a  scaled  by  AT,  the  node  count.  Note  that  initially  feedback 
effects  will  be  deleted.  The  initial  training  algorithm  considers  only  a  single 
pass  through  equation  (1).  Thus,  the  final  value  of  k  from  equation  (1)  is 
to  be  1.  This  is  consistent  with  the  desire  to  force  the  network  into  its  final 
attractor  state  as  rapidly  as  possible. 

Several  histories  are  to  be  created  and  continually  updated.  The  first  is  Ti: 


Fi- 


+ 


rr»u 


5,+! 


(4) 


where  is  the  previously  calculated  value  of  Fi  and  Bi  is  a  constant.  Bi 
is  used  to  adjust  the  time  constant  of  the  equation;  a  typical  value  is  10. 
Initially,  Fi  =  0.  Removed  from  the  beginning  of  the  training  algorithm, 
F 1  provides  a  measure  of  the  present  performance  of  the  training  algorithm 
compared  to  the  more  recent  training  cycles. 


A  parallel  set  of  node  weights  W  is  created.  These  are  updated  when 
Ti. 


■^a,n 


< 


B2 

l?u 

^a,n\ 

71,171  +  ^j, 71,771 

B2 

-^a,n 

+  1 

(5) 


for  all  m,  where  W  is  the  previously  calculated  value  of  W,  with  W  initialized 
to  0.  B2  is  an  adjustable  constant  typically  set  to  0.1.  When  is  a  small 

value,  B2  is  adjusted  so  that  B2  is  not  less  than,  typically,  5.0.  Equation 
(5)  requires  such  a  modification.  W  represents  an  improved  weight  set  when 
compared  with  W.  These  terms  are  not  incorporated  into  the  neural  net,  but 
rather  are  used  as  part  of  the  training  algorithm  to  periodically  adjust  the 
weights  of  the  neural  network. 


One  final  item  is  needed  before  a  discussion  of  the  training  algorithm  is 
possible — a  running  weight  average.  This  is  a  quantity  that  is  updated  with 
every  pass  through  the  algorithm.  It  reflects,  for  each  node,  an  adjusted 
average  about  the  weight  values  presently  being  invoked: 


1  9 


v: 


n,m  ji=l 


with 


V 


3,n,m 


^n,m 


8  &k,n,m  .  ,  yo 

Qi  >  J  /  ’a,m^ 


v'„  ^  = 

n^m 


7  - 

I J  a,m 
9 

y  ^  ^k,n,mi 
k=l 
9 

,n,m) 

k=l 


(6) 

(7) 

(8) 
(9) 


where  9k,n,m  is  an  accumulating  statistic  on  the  number  of  times  =  k  for 
all  previous  cycles  through  the  training  algorithm.  Ai^n,m  reflects  the  weight 
average  (connecting  nodes  n  and  m)  about  i,  weighted  in  favor  of  the  points 
nearest  to  i  and  those  points  that  have  previously  been  most  often  accessed. 
The  logic  behind  this  quantity  is  as  follows:  It  provides  a  target  for  adjusting 
the  weights  of  the  neural  network — a  target  that  ensures  a  relatively  smooth 
weight  function  as  i  transitions  from  1  through  9.  And  a  smooth  weight 
function  helps  to  minimize  the  difficulties  associated  with  defining  a  large 
basin  of  attraction. 
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There  is  now  sufficient  introductory  material  to  permit  a  presentation  of 
the  phase  one  training  algorithm.  If  >  0,  and  Wk,n,m  >  •A.k^n,Tn  where 
k  =  then  if  {Wk,n,m  -  Ak,n,m)  <  Kn, 

^a,n  —  ^a,n  ~  ^k,n,m  +  ,  (10) 

^^k,n,m  ■^k,n,m  •  (H) 


If  {Wk  ,n,m  -^k,n,m  )  >  Kn. 

Wk,n,m  =  -  El, 


(12) 


where  the  prime  indicates  the  previous  (prior  to  updating)  value  of  Wk^n^m- 

The  above  process  is  continued  for  additional,  randomly  chosen  values  of  m 
until  either  El,  goes  to  zero  (that  is,  eq  (12)  is  accessed)  or  all  values  of  m 


axe  exhausted.  If  the  above  process  is  exhausted  and 


T?U 

■^a,n 


d  >  0,  then  one 


additional  step  is  required  in  the  phase  one  training:  Assuming  >  0 


IT/ 


k,n,m 


k^n^m 


■^lK.n.,n  -  Wm, (13) 


where 


and  Wk^n,m  ^  ^^k,n,m 


N 


4>= 

m=\ 


(14) 


In  equations  (10)  to  (12)  above,  the  algorithm  resolves  the  difference  between 
the  output  of  the  neural  network  and  the  desired  point  attractor  by  forcing 
the  weights  via  equation  (11)  or  equation  (12)  to  yield  the  proper  output. 
Where  this  cannot  be  fully  achieved  (based  upon  the  eq  (10)  to  (12)  internal 
criteria),  equations  (13)  and  (14)  are  invoked.  This  adjusts  the  weights  based 
on  the  values  of  W,  the  historically  best  weight  fit  to  the  evolving  memory. 
The  phase  one  process  is  continued  until  the  improvement  in  network  per¬ 
formance,  determined  by  equation  (4),  asymptotically  approaches  its  final 
value.  At  this  point,  step  two  training  is  invoked. 


3.2  Steps  Two,  Three,  and  Four  Training 

Step  two  training  involves  a  more  careful  analysis  of  the  performance  of  the 
network  after  each  step  one  training  cycle  and  a  modification  of  the  standard 
for  calculating  W  (eq  (5)  is  disabled).  At  the  end  of  each  step  one  training 
cycle,  the  following  additional  concepts  are  invoked.  With  each  training  cycle. 
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the  newly  calculated  weight  value  is  tested  (per  eq  (1))  to  a  set  of  Ci  random 
vectors  (vector  designation  =  a).  Typically,  Ci  =  20.  The  average  error  is 
calculated: 


1  ^ 

/ 

1  S.  1 

n  \ 

II 

a=l 

0-5  +  E 

m=l 

-Kn) 

An  error  history  is  generated  where  if  then 


(15) 


P  C2P2  +  4“ 

C2+1 

where  C2  is  a  constant  (typical  value  =  10),  and  r2  is  the  previously  cal¬ 
culated  value  of  r2.  After  sufficient  training  cycles  (typically  100)  to  ensure 
that  equation  (16)  is  no  longer  dominated  by  its  initialization  value,  the  W 
terms  can  be  adjusted:  If  and  only  if 


4“  <  r'2,  (17) 

then 

'^j,n,m  ~  ^j,n,Tn-  (1^) 

If  after  typically  200  training  cycles,  equation  (17)  is  not  satisfied,  then 


=  yVj^n,m  (19) 

for  all  j,  n,  and  m,  and  training  continues.  As  this  training  phase  approaches 
a  limiting  value  (based  on  the  eq  (16)  results),  the  third  step  of  training 
is  entered.  The  third  step  is  identical  to  the  second  step,  except  that  the 
error  term  of  equation  (15)  is  based  on  a  single  level  of  feedback.  That  is, 
for  equation  (1)  the  final  value  for  k  is  raised  from  1  to  2.  The  fourth  step  of 
training  is  identical  to  the  third,  except  that  the  final  value  for  k  is  3. 
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4.  Neural  Network  Performance 


Figures  1  to  3  summarize  the  performance  of  the  neural  network  and  its  train¬ 
ing  algorithm.  On  these  figures,  error  rate  refers  to  the  failure  of  a  randomly 
chosen  vector  to  converge  to  the  closest  memory  (or  point  attractor).  A  basin 
edge  is  defined  by  that  set  of  points  equidistant  between  some  memory  state 
and  its  nearest  neighbor.  Each  curve  can  then  be  interpreted  as  an  average 
shape  of  the  basin  of  attraction  about  each  intentional  point  attractor. 

Each  memory  set  and  test  vector  is  chosen  randomly.  All  memory  sets  were 
selected  from  a  large  base  of  vectors.  The  members  of  each  set  were  cho¬ 
sen  to  be  as  distant  as  practical  (based  on  the  distance  definition  of  eq  (2)) 
from  all  other  members,  based  on  a  random  selection  process.  These  figures 
demonstrate  what  was  observed  throughout  the  algorithm  test  phase:  model 
performance  is  substantially  independent  of  the  vector  set  used.  The  curves 
of  figures  1  to  3  (the  results  of  a  single,  randomly  chosen  family  of  attrac¬ 
tor  states)  are  a  reasonable  representation  of  the  performance  of  the  neural 
network  and  its  training  algorithm. 

The  results  in  figures  1  to  3  can  be  summarized  succinctly.  Figure  1  demon¬ 
strates  the  expected  result-— as  the  memory  count  increases,  the  basin  sizes 
shrink.  Figures  2  and  3  demonstrate  that  little  is  to  be  gained  from  using 
more  than  three  stages.  Figures  2  and  3  also  demonstrate  that  memory  ca¬ 
pacity  scales  poorly  (if  at  all)  with  an  increased  vector  length — at  least,  in 
the  range  of  N  up  to  15.  The  curves  also  demonstrate  that,  depending  on 
memory  count,  each  memory  can  have  a  sizable  error-free  basin  of  attraction. 
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Figure  1.  Basin  performance  as  function  of  memory  count.  Memory  count  =  2  to  5  and 
vector  length  =  12. 


Figure  2.  Basin  performance  as  function  of  stage  count.  Memory  count  =  3,  vector  length 
=  8,  and  stage  count  =  1  to  3. 


0.00  0.10  0.20  0.30  0.40  0.50 

Distance  from  attractor  to  basin  edge 


Figure  3.  Basin  performance  as  function  of  stage  count.  Memory  count  =  4,  vector  length 
=  15,  and  stage  count  =  1  to  3. 
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5.  Conclusion 


I  have  demonstrated  the  feasibility  of  creating  a  multistate  (as  opposed  to 
binary)  multistage  neural-network-based  content-addressable  memory  and 
training  algorithm.  The  model  can  generate  large  basins  of  attraction,  albeit 
for  a  very  limited  number  of  attractor  states. 
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